DOCUMENT RESUME 



ED 440 149 



TM 030 761 



AUTHOR 

TITLE 

INSTITUTION 



SPONS AGENCY 

REPORT NO 
PUB DATE 
NOTE 

CONTRACT 
PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Stecher, Brian M. ; Barron, Sheila I. 

Quadrennial Milepost Accountability Testing in Kentucky. CSE 
Technical Report. 

California Univ., Los Angeles. Center for the Study of 
Evaluation.; Center for Research on Evaluation, Standards, 
and Student Testing, Los Angeles, CA. 

Office of Educational Research and Improvement (ED) , 
Washington, DC. 

CSE-TR- 505 
1999-06-00 
4 Op . 

R305B60002 

Reports - Research (143) 

MF01/PC02 Plus Postage. 

* Academic Standards; ^Accountability ; Behavior Patterns; 

Case Studies; Elementary Secondary Education; Models; State 
Programs; *State Standards; Surveys; *Teachers; *Teaching 
Methods; Test Use; *Testing Programs 

*Kentucky; Kentucky Instructional Results Information System 



ABSTRACT 



Kentucky has been implementing test -based accountability for 
almost a decade, making it a good site for studying the effects of the 
milepost testing model. In 1996, a study was undertaken of the impact of 
standards-based assessment on classroom practices in Kentucky. Kentucky 
teachers (n=365) were surveyed about their classroom practices and other 
school practices during the < 1996-1997 and 1997-1998 school years. Case 
studies of a small group of exemplary teachers were also conducted. The 
1997-1998 survey involved writing and mathematics teachers, both subjects 
that were assessed using portfolios. The study confirms some of the positive 
effects of test-based accountability that have been reported previously, but 
it also reveals previously unexamined negative consequences arising from 
high-stakes tests. On the positive side, teachers reacted to the Kentucky 
Instructional Results Information System by changing their behaviors in ways 
that were consistent with the specific targets of the system. On the negative 
side, teachers focused on the most proximal aspects of the system (tests) 
rather than the more distant goals. The testing and accountability system may 
be leading teachers to a near-sightedness with several consequences. One is 
large swings in exposure to specific subjects from year to year. The focus on 
testing and on milepost grade levels may deflect attention from the 
cumulative nature of education. (Contains 15 tables and 22 references.) ( SLD) 
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QUADRENNIAL MILEPOST 

ACCOUNTABILITY TESTING IN KENTUCKY! 

Brian M. Stecher and Sheila I. Barron 
National Center for Research on Evaluation, 
Standards and Student Testing (CRESST) 
RAND Education 



Abstract 

Kentucky provides an opportunity to study a high-stakes test-based accountability 
system that uses milepost testing to see how schools react to the accountability 
pressures. Kentucky has been in the forefront of the test-based accountability 
movement, and many states have looked to Kentucky when designing their own 
accountability systems. Hence, lessons learned in Kentucky will be immediately 
relevant to other states. 



Background 

The past two decades have seen a dramatic increase in states' use of tests as 
educational policy tools, although researchers have raised questions about the 
validity of scores produced by high stakes state tests and the impact of these tests 
on classroom practices. These questions remain unresolved, in part, because the 
increase in state testing has occurred rapidly and testing practices have changed 
as well. The number of states with mandated student testing programs grew 
from 29 in 1980 to 46 in 1992 (Office of Technology Assessment, 1992). The 

1 This project would not have been possible without assistance from the Kentucky Department of 
Education and cooperation from teachers across the state. In particular, we want to acknowledge 
the support of the staff from the Kentucky Department of Education, including Brian Gong, Sue 
Rigney, Starr Lewis, and Jonathan Dings. In addition, we would like to thank the hundreds of 
Kentucky classroom teachers who took the time to complete our survey. 

Our RAND colleagues Susan Weinblatt, Suzanne Perry, and Linda Daly deserve credit for 
coordinating the statewide survey effort, including production, distribution, monitoring, review, 
and data editing. Our thanks to Cathy Krop for assisting with survey design and Tammi Chun for 
coding the open-response questions. 



increase in state testing has been accompanied by the introduction of new types 
of assessments, an increase in the stakes attached to scores, and the incorporation 
of tests into formal accountability systems. A recent survey of state assessment 
practices found that 39 states were administering some form of performance 
assessment and six others were planning or developing performance 
assessments; 24 states attached stakes to their tests in the form of student 
recognition, promotion, or graduation; and 40 states used test scores for school 
accountability purposes (Bond et al., 1995). 

The growth in state testing has been motivated by two broad goals: to 
produce valid indicators of student and school outcomes and to promote 
improved instruction. However, research suggests that state tests, particularly 
those with high stakes, do not always achieve these goals. High-stakes state 
testing programs, even those employing multiple choice tests, do not always 
produce valid information on students or schools (Linn, Graue, & Saunders, 
1990; Koretz, Linn, Dunbar, & Shepard, 1991). Similarly, high stakes testing can 
have undesirable effects on instructional practice, most notably narrowing of the 
curriculum and undue focus on test-like activities (Kellaghan & Madaus, 1991; 
Shepard & Dougherty, 1991; Smith & Rothenberg, 1991). 

The use of performance tasks, particularly portfolios, exacerbates the 
problem of score validity (Koretz, 1998). For example, studies of portfolio-based 
assessments in Vermont and Kentucky found that student work could not be 
rated reliably and that scores were not valid for their intended purposes (Koretz, 
Klein, McCaffrey, & Stecher, 1993; Hambleton et al., 1995). More recent studies 
have also raised questions about the validity of scores from on-demand open- 
response testing, as well (Koretz & Barron, 1998). 

The effects of performance assessments on instructional practices are more 
complex. One of the rationales for the introduction of performance assessment 
in state testing programs was to signal instructional direction without narrowing 
the curriculum as multiple choice tests had done (Resnick & Resnick, 1992). 
There is some evidence that positive curriculum change has occurred. For 
example, portfolio assessment has lead to positive changes in teaching practices 
(Stecher & Herman, 1997). Principals and teachers in Maryland and Kentucky 
generally believe that test-based accountability has at least a small positive impact 
of instruction, and that accountability has caused them to focus on content and 
skills that are assessed (Koretz, Mitchell, Barron, & Keith, 1996; Koretz, Barron, 



Mitchell, & Stecher, 1996). Similarly, teachers in Kentucky report increased focus 
on tested subjects and increased use of practices encouraged by the test reformers 
(Stecher, Barron, Kaganoff, & Goodwin, 1998). To the extent that the goal of test- 
based accountability is to focus on previously neglected content or skills (e.g., 
writing, problem solving), it appears to be generally successful. 

The use of performance assessment, however, has not eliminated all of the 
negative instructional consequences associated with high stakes testing (Koretz, 
Stecher, Klein, & McCaffrey, 1994). For example, teachers in Vermont focused on 
the portfolio scoring rubrics rather than the domains of mathematics the 
assessment was supposed to measure (Stecher & Mitchell, 1995). Similarly, 
teachers in Maryland and Kentucky appear to focus inappropriately on test 
preparation activities that do not generalize to the curriculum as a whole 
(Koretz, Mitchell, Barron, & Keith, 1996; Koretz, Barron, Mitchell, & Stecher, 
1996; Koretz & Barron, 1998). There also is evidence that high stakes performance 
assessments create conflicting pressures on teachers, who have a difficult time 
balancing the need to produce high scores with the desire to incorporate more 
authentic activities into their lessons (Borko & Elliott, 1999; Wolf & Mclver, 
1999). 

Test-Based Accountability 

Recently, a number of states have adopted formal school accountability 
systems that rely heavily on test scores. The popularity of formal test-based 
accountability is growing because it is seen by many as a relatively quick, 
relatively inexpensive, and highly visible way to bring about changes in schools. 
Policymakers hope such systems will encourage educational improvement by 
sending strong, clear signals to schools about their success. 

However, the research on testing effects suggests that the manner in which 
an assessment system is structured will affect its utility as a tool for program 
improvement (Linn, 1999). For example, choices about high or low stakes, 
multiple choice, or performance assessment will have consequences in terms of 
teachers' behaviors and students' scores. In the accountability context, we would 
expect to find, further, that choices about grade levels and subject matter will 
make a difference, as will the nature of the adopted standards and the rewards or 
sanctions associated with performance. The present study examines changes in 
classroom practices in a state with a sophisticated test-based accountability system 



that measures performance in selected subjects in selected "milepost" grades 
chosen from each school level (elementary, middle, and high school). 

Before describing the study, we want to clarify what is meant by an 
accountability system. In simple terms, accountability is a relationship between 
two parties in which one party is expected to accomplish a particular goal and the 
other party is expected to provide benefits when the goal is accomplished (Hill & 
Bonan, 1991). A state accountability system is somewhat more complex because 
more parties are involved, goals are more abstract, and the flow of information is 
formalized. 

Figure 1 is a model of a test-based state accountability system. The system 
involves relationships among four parties: state policymakers, school personnel, 
students and the public. State policymakers establish goals or expectations for 
students. Increasingly these goals take the form of performance standards that are 




Figure 1. Test-based State Accountability System 



adopted by state Boards of Education (Association of California School 
Administrators, 1996). It is important to note that most state standards are 
written at a high level of generality (e.g., communicate mathematics concepts 
effectively) which do not offer specific guidance about the content of lessons or 
the methods of instruction. 

Schools provide the educational services that help students achieve the 
desired goals. School administrators set local policies and teachers implement 
specific classroom practices to promote student achievement. As a result of their 
classroom experiences, students acquire knowledge, master skills and develop 
attitudes toward learning. These student outcomes are compared to the standards 
to determine whether schools have been successful. Information about school 
performance is reported to the schools and to the general public. Schools enact 
changes based on these reports to improve the services they provide and 
enhance student outcomes. In addition, parents and community members may 
informally endorse or criticize the schools on the basis of the public information. 
In some systems, the state adds formal consequences. Successful schools are 
rewarded through recognition or financial incentives; unsuccessful schools are 
given assistance to help them improve. Under extreme circumstances, 
chronically unsuccessful schools may be reconstituted. 

Tests play a critical role in this system. Students' normal school output 
cannot be easily translated into the language of the state standards. It is not 
possible to tell from grades, homework, and classroom work products whether 
students have achieved the desired performance standards. As a result, states 
create testing systems to measure student performance in ways that can be more 
easily judged against the standards. The school's accomplishments are measured 
and reported publicly using test results as indirect indicators of student 
accomplishment. The test defines the specific aspects of student performance that 
will be measured and reported. 

However, practical constraints limit the amount and extent of testing that 
can be conducted. Consequently, test-based accountability systems actually 
measure a limited number of domains using a limited amount of data. In 
general, the selection of tests represents a compromise between practical 
considerations, political considerations, and the broad goals of the system. 



Test results play a key role in the schools' responses to the accountability 
information, as well. The staff responds to published test results, rewards or 
sanctions, and other feedback by reinforcing practices perceived to be successful 
and modifying practices perceived to be unsuccessful. Ideally, the feedback will 
exert pressure on schools and teachers to make changes that will help students 
master the standards. Staff will focus their actions on promoting the broad goals 
endorsed by the state and student performance will improve. Evidence cited 
above shows that teachers do change their behaviors in response to such 
feedback. 

Given limited time and resources, however, schools often direct their 
attention more narrowly to practices that will enhance student performance on 
the tests. This is one way in which the discrepancy between broad goals and 
specific measures may reduce the effectiveness of a test-based accountability 
system. 

Of particular concern in this study is the use of a system that only tests 
certain subjects and only tests in selected, milepost grade levels. To minimize 
costs and testing burden, many states follow the example of NAEP and test at one 
grade level each in elementary, middle, and high school. This pattern lowers the 
testing burden on schools and the expense of the testing program. (Concerns 
about cost and testing burden are heightened with tests that contain open- 
response questions, which are more time-consuming to administer and to score.) 
Policymakers assume that local schools will translate information in the 
milepost grades into proper guidance for other grade levels. For example, the 
school would work backwards from fourth-grade objectives (if that is the 
accountability grade) to develop precursor skills and objectives in third, second, 
and first grade. They might also develop benchmarks for performance at the 
earlier grade levels. However, a more limited response would be one that 
focused narrowly on certain grades, subjects, topics within subjects, and 
achievement standards, but not the full range. 

Test-Based Accountability in Kentucky 

Kentucky provides an opportunity to study a high-stakes test-based 
accountability system that uses milepost testing to see how schools react to the 
accountability pressures. Kentucky has been in the forefront of the test-based 
accountability movement, and many states have looked to Kentucky when 



designing their own accountability systems. Hence, lessons learned in Kentucky 
will be immediately relevant to other states. 

The accountability model in Figure 1 fits the Kentucky system quite well, 
and we will briefly review each of the components. The Kentucky Educational 
Reform Act of 1990 established six broad goals for Kentucky schools. For example, 
the goal that relates to academic achievement states, "Schools shall develop their 
students abilities to use basic communication and mathematics skills for 
purposes and situations they will encounter throughout their lives" (Kentucky 
Department of Education, 1994, p. 2). A task force formed by the Kentucky 
Department of Education elaborated these into more detailed Academic 
Expectations that described what students should be able to do at the conclusion 
of their education. For example, "1.11: Students write using appropriate forms, 
conventions and styles to communicate ideas and information to different 
audiences for different purposes" (Kentucky Department of Education, 1994, p. 2). 
Later, KDE added further clarification in a document called Transformations: 
Kentucky's Curriculum Framework (Kentucky Department of Education, 1995). 
This included general descriptors of what would be appropriate at the 
elementary, middle school and secondary levels. In the case of writing, there are 
six elementary demonstrators, ranging from "Express thought/ideas through 
verbal and/or symbolic representation (e.g., pictures, scribbles, words)" to 
"Establish and use criteria for effective writing to evaluate own and others' 
writing" (Kentucky Department of Education, 1995, p. 26). Further clarification 
was issued in 1996 in a document called Core Content for Assessment (Kentucky 
Department of Education, 1996), but these remain general descriptions. 

Kentucky developed their own assessment system to measure progress 
toward meeting its Academic Expectations. The Kentucky assessment system is 
perhaps the most elaborate in the country .2 At the time this study was 
conducted, seven subjects were tested, and each was measured in one elementary 
grade level, one middle school grade, and one high school grade. The Kentucky 
assessment emphasized performance assessments, including portfolios and 
open-response measures. To reduce pressure on individual students and 



2 Until recently, the Kentucky assessment system was known as the Kentucky Instructional Results 
Information System (KIRIS), and many people are familiar with this acronym. In 1998 the system 
was reformed and the name was changed to the Commonwealth Achievement Testing System 
(CATS). 



emphasize improvement as a whole-school activity, the system only reported 
data at the school level. The school accountability index also included non- 
cognitive measures (including attendance, drop-out rates, etc.) although they 
contributed just one-sixth of the total score and showed very little variability. 

The Kentucky system has undergone many changes over the years. 
Originally all subjects were assessed in three grades, but the burden of portfolios 
and constructed response items was so great some of the testing shifted to 
adjacent grades. In 1996-97, four subjects were tested in grades 4, 7, and 11. Three 
others were tested in grades 5, 8, and 11. More recently multiple choice tests with 
individual scores have been added. 

The Kentucky accountability system provides both informational feedback 
and consequences. In fact, Kentucky is a good example of an accountability 
system with high stakes for schools. Scores are published and widely 
disseminated. Schools can receive financial rewards to be distributed among the 
staff if student scores on the state assessment exceed improvement targets, and 
the schools are subject to review and external intervention if students' 
performance is consistently poor. The level of performance needed to receive 
rewards is tied to continual improvement not absolute attainment. This 
approach puts pressure on all schools because even those who are scoring the 
highest have to show improvement in the next accountability cycle. 

Finally, Kentucky provides extensive professional development 
opportunities to help schools and teachers improve student performance. A 
network of regional centers was created to help teacher understand the academic 
expectations and the new assessment system and to integrate them into 
instruction. 

Kentucky has been implementing test-based accountability for almost a 
decade, making it a good site for studying the effects of the milepost testing 
model. Teachers have had ample opportunity to become familiar with the 
Academic Expectations and the format in which they are measured. Kentucky 
provides a unique opportunity to see the degree to which accountability reactions 
generalize beyond the tested grade level, and to see whether teachers respond to 
the narrow signals of specific tests or the larger targets embodied in the academic 
standards. 



Procedures 



In 1996, RAND undertook a two-year study of the impact of standards-based 
assessment on classroom practices in Kentucky. Kentucky teachers were 
surveyed on their classroom practices as well on other school practices during 
the 1996-97 school year and during the 1997-98 school year. Also during this time 
period, case studies of a small group of exemplary teachers were conducted. This 
paper is based on the survey results of the second year of that effort. Results of 
the case studies and the first year survey are presented elsewhere (Borko & 
Elliott, 1999; Wolf & Mclver, 1999; Stecher, Barron, Kaganoff, & Goodwin, 1998). 

The 1997-98 RAND/CRESST survey of Kentucky teachers involved writing 
and math teachers from grades 4-7. We selected these subjects because of their 
importance in the Kentucky education reform and because both were assessed 
using portfolios, one of the more innovative components of KIRIS. Grades 4-7 
were selected in order to obtain responses from teachers in accountability grades 
(grades 4 and 7 for writing, grade 5 for math) as well as in a non-accountability 
grade (grade 6). This report summarizes the results of the survey that pertain to 
differences in practice related to the accountability burden in each grade. 

Sampling 

Kentucky schools were classified into two overlapping groups based on the 
grade levels that were taught in the school. All schools containing grades 4 and 5 
were included in the elementary school sample and all schools containing grade 
6 and 7 where included in the middle school sample (some schools; e.g., K-8 
schools) were included in both groups whereas others (e.g., K-4) were not 
included in either group. We were interested in obtaining information from 
teachers in both accountability and non-accountability grades about the pressure 
they feel to prepare their students for the KIRIS tests the students will be taking 
in the following grade. Grade 4 is a not an accountability grade for math but grade 
5 is so we wanted to sample fourth-grade teachers in schools that also contained 
fifth grade. Similarly, sixth grade not an accountability grade in writing but 
seventh grade is so we wanted to sample sixth-grade teachers in schools that also 
contained seventh grade. 

For each population, schools were divided into four strata of equal size 
based on average enrollment in the grades of interest. Schools with fewer than 20 
students in the accountability grade were excluded from the sampling frame, as 



were schools with recent changes in their service areas. Within each stratum a 
random sample of schools was chosen. Seventy-two schools were selected for the 
elementary school sample and eighty schools were selected for the middle 
schools sample. No school was chosen for more than one sample. 

A letter was sent to the principal of each school at the beginning of 1998 
explaining the study and requesting the names of the instructors teaching the 
identified grades and subjects. Principals were subsequently contacted by 
telephone to retrieve these names. Ninety-three percent of the principals in the 
sampled schools provided the requested information. 

The teachers were contacted by mail and asked to participate in the study. 
The contact letter explained the study and asked for their participation. Enclosed 
with the request were a letter from the Department of Education urging teacher 
cooperation, a copy of the survey to be completed, a return envelope, and a pre- 
paid $10 long distance phone card. Teachers could keep the phone card whether 
or not they returned the survey. 

Four hundred and seventy-nine teachers completed the survey for an 
overall response rate of 54 percent. This was lower than the response rate 
achieved in previous RAND surveys of Kentucky teachers. Several explanations 
for the lower response rate are plausible. First, the survey was mailed to teachers 
near KIRIS testing time so some teacher may have felt they just didn't have the 
time to complete it. Second, there has been considerable research conducted in 
Kentucky and some teachers may have grown weary of survey requests. Third, 
when the survey was in the field, the state legislature was deciding to eliminate 
the KIRIS program and adopt a new accountability program. Thus teachers may 
have felt that with the KIRIS system on the way out, the survey results would be 
of little use. 

Survey Design and Data Collection 

Building on past RAND research, the surveys addressed a broad range of 
issues related to classroom practices. Major themes included professional 
development, school and class organization, curriculum and instruction, test 
preparation, and school level practices related to the accountability assessments. 

Most of the survey questions were presented in a closed format. 
Respondents were asked to provide numerical answers or to select one option 
from a predetermined set of options (e.g., three-, four-, and five-point Likert 



scales, and yes/no questions). We also asked a number of open-response 
questions that allowed respondents to explain or expand upon their answers to 
the closed-ended questions. 

Change was a predominant theme in the survey. For most questions about 
practice, teachers were asked about current behaviors (during the 1997-98 school 
year) and about changes during the past three-year period. Only teachers with at 
least three years of experience answered questions about changes in practice. 
Twelve percent of the elementary teachers and 18 percent of the middle school 
teachers indicated that they could not answer these questions. 

Analysis 

Because we were interested in comparing teachers classroom practices across 
grades, only teachers who teach a single grade were included in the analyses 
reported here (N=365). 

All analyses were conducted using weighted data. The weight assigned to 
each case was the product of the inverses of the probability that the school would 
be selected and probability that the sampled individuals would participate 
(complete the survey). Descriptive statistics were calculated overall and 
separately for each grade. When data were combined across grades, the grades 
were weighted equally in the combined statistics. For the Likert questions, 
frequencies were computed. For questions requiring a numerical response, we 
calculated means, medians, and standard deviations. 

We tested the significance of the differences between responses for teachers 
in different grades. The majority of statistical tests performed were chi-squared 
tests comparing responses across grades where the responses were dichotomized 
(e.g., no/yes, low/high, or less frequently /more frequently). For questions 
requiring a numerical response one-way analyses of variance (ANOVA) were 
used to compare groups. 

There were a number of open-response questions on the survey. Responses 
to these questions were read by project staff, and codes were developed for all 
responses that occurred with any regularity. The responses were then coded and 
the codes tallied. Responses to these questions are only used for descriptive 
purposes - no significance tests were carried out. 



Results 



There were strong associations between the grade levels at which specific 
subjects were assessed in KIRIS and the educational practices of teachers in those 
grades. Differences related to KIRIS grade levels were found in teachers' 
participation in professional development, their allocation of instructional time 
across subjects (in self-contained classrooms), and the relative emphasis they 
placed on specific topics within the subjects of mathematics and writing. 

Table 1 shows the KIRIS testing grades by subject for 1997-98. These are the 
milepost subject-by-grade combinations that serve as the reference for 
comparisons among the survey results for teachers in different grades. 
Mathematics portfolios have been part of the KIRIS assessment system off and 
on. During the 1997-98 school year, they were not officially being collected or 
scored. However, many teachers did have their students compose mathematics 
portfolios in anticipation of their return to the accountability system in future 
years. 



Table 1 

Assessment Grade Levels for KIRIS, 1997-98 



Subject 




Grade 




Fourth 


Fifth Sixth 


Seventh 


Reading 


o 




o 


Writing 


p 




p 


Science 


o 




o 


Mathematics 




OP* 




Social Studies 




o 




Arts and Humanities 




o 




Practical living/Vocational education 




o 





Note. P designates cumulative portfolio assessments; O designates on-demand, open-response 
assessments. 

*KIRIS mathematics portfolios were not in use at fifth grade in 1997-98. However, because they 
were scheduled to be added to the KIRIS mathematics assessment at fifth grade in the future, 
many teachers were using them at least informally. 



